Effect of lossy compression of quality scores on variant calling

نویسندگان

  • Idoia Ochoa
  • Mikel Hernaez
  • Rachel L. Goldfeder
  • Tsachy Weissman
  • Euan Ashley
چکیده

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CALQ: compression of quality values of aligned sequencing data.

Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For...

متن کامل

Adaptive reference-free compression of sequence quality scores

MOTIVATION Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, ...

متن کامل

Lossy compression of quality scores in genomic data

MOTIVATION Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confid...

متن کامل

ON A LOSSY IMAGE COMPRESSION/RECONSTRUCTION METHOD BASED ON FUZZY RELATIONAL EQUATIONS

The pioneer work of image compression/reconstruction based onfuzzy relational equations (ICF) and the related works are introduced. TheICF regards an original image as a fuzzy relation by embedding the brightnesslevel into [0,1]. The compression/reconstruction of ICF correspond to thecomposition/solving inverse problem formulated on fuzzy relational equations.Optimizations of ICF can be consequ...

متن کامل

فشرده‌سازی تصویر با کمک حذف و کدگذاری هوشمندانه اطلاعات تصویر و بازسازی آن با استفاده از الگوریتم های ترمیم تصویر

Compression can be done by lossy or lossless methods. The lossy methods have been used more widely than the lossless compression. Although, many methods for image compression have been proposed yet, the methods using intelligent skipping proper to the visual models has not been considered in the literature. Image inpainting refers to the application of sophisticated algorithms to replace lost o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Briefings in bioinformatics

دوره 18 2  شماره 

صفحات  -

تاریخ انتشار 2017